Problem statement

Background

As the living quality improved, nowadays many people consider traveling as their first choice to spend their spare time. That should be an unforgettable experience traveling with family and friends, but we often hear bad news about property loss, injury or even death during traveling. Although being careful is important to avoid danger from traveling, choosing a proper destination, safe transportation, reliable agency could also be a good way to protect ourselves.

So what factors might cause the accidents during traveling?

Proposed Solution

Apply Bayesian analysis on travel insurance dataset.

Dataset

Travel insurance dataset from Kaggle, Provided by a third-party insurance servicing company based in Singapore.

##      Agency             Agency.Type    Distribution.Channel
##  EPX    :35119   Airlines     :17457   Offline: 1107       
##  CWT    : 8580   Travel Agency:45869   Online :62219       
##  C2B    : 8267                                             
##  JZI    : 6329                                             
##  SSI    : 1056                                             
##  JWT    :  749                                             
##  (Other): 3226                                             
##                           Product.Name   Claim          Duration      
##  Cancellation Plan              :18630   No :62399   Min.   :  -2.00  
##  2 way Comprehensive Plan       :13158   Yes:  927   1st Qu.:   9.00  
##  Rental Vehicle Excess Insurance: 8580               Median :  22.00  
##  Basic Plan                     : 5469               Mean   :  49.32  
##  Bronze Plan                    : 4049               3rd Qu.:  53.00  
##  1 way Comprehensive Plan       : 3331               Max.   :4881.00  
##  (Other)                        :10109                                
##     Destination      Net.Sales       Commision..in.value.  Gender     
##  SINGAPORE:13255   Min.   :-389.00   Min.   :  0.00       F   : 8872  
##  MALAYSIA : 5930   1st Qu.:  18.00   1st Qu.:  0.00       M   : 9347  
##  THAILAND : 5894   Median :  26.53   Median :  0.00       NA's:45107  
##  CHINA    : 4796   Mean   :  40.70   Mean   :  9.81                   
##  AUSTRALIA: 3694   3rd Qu.:  48.00   3rd Qu.: 11.55                   
##  INDONESIA: 3452   Max.   : 810.00   Max.   :283.50                   
##  (Other)  :26305                                                      
##       Age        
##  Min.   :  0.00  
##  1st Qu.: 35.00  
##  Median : 36.00  
##  Mean   : 39.97  
##  3rd Qu.: 43.00  
##  Max.   :118.00  
## 

Totally 63326 records, 11 variables.
Target: Claim status (YES/NO), Claim status of insurance could indicate whether the customer encountered accident during traveling.
Features: Agency, Agency type, Distribution channel, Product name, Duration, Destination, Net sales, Commission, Gender, Age.

Data preprocessing

Check for null values

To many empty value in Gender, drop column directly

## [1] 45107

Drop outlier for continuous variables

Age

Duration

Commission

Net.Sales

##      Agency             Agency.Type    Distribution.Channel
##  EPX    :34382   Airlines     :17072   Offline: 1077       
##  C2B    : 8080   Travel Agency:43692   Online :59687       
##  CWT    : 7220                                             
##  JZI    : 6175                                             
##  SSI    : 1046                                             
##  JWT    :  740                                             
##  (Other): 3121                                             
##                           Product.Name   Claim          Duration      
##  Cancellation Plan              :18212   No :59840   Min.   :  -2.00  
##  2 way Comprehensive Plan       :12907   Yes:  924   1st Qu.:   9.00  
##  Rental Vehicle Excess Insurance: 7220               Median :  22.00  
##  Basic Plan                     : 5352               Mean   :  48.94  
##  Bronze Plan                    : 3967               3rd Qu.:  52.00  
##  1 way Comprehensive Plan       : 3263               Max.   :4881.00  
##  (Other)                        : 9843                                
##     Destination      Net.Sales        Commision            Age        
##  SINGAPORE:12958   Min.   :  0.07   Min.   :  0.000   Min.   :  0.00  
##  THAILAND : 5735   1st Qu.: 19.80   1st Qu.:  0.000   1st Qu.: 35.00  
##  MALAYSIA : 5643   Median : 28.00   Median :  0.000   Median : 36.00  
##  CHINA    : 4675   Mean   : 43.10   Mean   :  9.346   Mean   : 39.98  
##  INDONESIA: 3393   3rd Qu.: 49.50   3rd Qu.: 10.500   3rd Qu.: 43.00  
##  AUSTRALIA: 3316   Max.   :810.00   Max.   :283.500   Max.   :118.00  
##  (Other)  :25044

Label encoding

For the taget “Claim”, we do label encoding, “Yes” to 1, “No” to 0.

## 
##     0     1 
## 59840   924

EDA

For the target, “Claim”

## [1] "Yes:"
## [1] 0.01520637
## [1] "No:"
## [1] 0.9847936

Two levels, 62399 (98.5%) “No”, 927 (1.5%) “Yes”. They are imbalanced, so we do upsampling but the performance become worse.

For the caregorical features,

## [1] 16
## [1] 2
## [1] 147
## [1] 2
## [1] 25

“Agency”: 15 levels, “Destination”: 147 levels, “Product name”: 26 levels.

Too many levels in “Agencys”, “Destination” and “Product name”, then it will generate too many features after one-hot encoding.
“Destination”: 147 levels → Top 10 levels + 1 “Others” level
”Agency”: 15 levels → Keep
“Product”: Due to the correlation with “Agency” → Drop column

## [1] 10
## [1] 147
## [1] 137

Feature Selection

As there are too many categorical variable in our dataset, consider the computing power, we will use lasso to do the feature selection and select the top 15 important features.

##  [1] "AgencyC2B"                  "Distribution.ChannelOnline"
##  [3] "AgencyLWC"                  "AgencyTST"                 
##  [5] "AgencyKML"                  "AgencyRAB"                 
##  [7] "AgencySSI"                  "AgencyCBH"                 
##  [9] "AgencyCSR"                  "AgencyCWT"                 
## [11] "DestinationSINGAPORE"       "AgencyJWT"                 
## [13] "AgencyCCR"                  "DestinationMALAYSIA"       
## [15] "DestinationPHILIPPINES"

Robust Logistic regression

We use JAGS to run the robust logistic regression with target “Claim”. Firstly, we need to do one-hot encoding on the categorical features, and select out the top 15 columns selected by Lasso. Then we split the data into training(90%) and testing(10%) dataset. We also check the correlation between features, we can see only AgencyC2B and DestinationSingapore has a little bit higher correlation, but due to actual situation, AgencyC2B client’s destination are all to Singapore, but Client travel to Singapore are not all from AgencyC2B, we decide to keep this two features, and have a look at the result.

## 'data.frame':    60764 obs. of  16 variables:
##  $ AgencyC2B                 : num  1 1 1 1 1 0 0 0 0 0 ...
##  $ Distribution.ChannelOnline: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ AgencyLWC                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AgencyTST                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AgencyKML                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AgencyRAB                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AgencySSI                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AgencyCBH                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AgencyCSR                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AgencyCWT                 : num  0 0 0 0 0 1 1 1 1 1 ...
##  $ DestinationSINGAPORE      : num  1 1 1 1 1 0 0 0 0 0 ...
##  $ AgencyJWT                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AgencyCCR                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DestinationMALAYSIA       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DestinationPHILIPPINES    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Claim                     : num  0 0 1 0 0 0 0 0 0 0 ...

Finally, we generate the Robust Logistic regression using 2000 burn-in period, 4 chains and 3750 iterations. The distribution of the guessing parameter in Robust Logistic regression shows, its value is very small, which means the model looks very similar to an ordinary logistic regression. And it is significant, 0 is not in the HDI, so we should not ignore the guessing parameter.

From the distribution of posterior, we can see, there are 7 features beta are not significant, in other words, the HDI of the distribution contains “0”, so these features may have beta equals zero, which means these features may not significant when interpreting the target “Claim”. For agency with positive mode like AgencyCWT, indicate travel with this agency may have higher chance to claim travel insurance. For Destination with negative mode, indicate travel to these countries are safe, insurance claim seldom happened.

Diagnosis analysis

From each beta’s diagnosis, we can see, beta0 and beta5 are converge very good and have good Effective sample size, which mean these two betas are very stable.

When looking at beta1 and beta11, we can see these two betas are converge very bad, the trace plots are sticky, the autocorrelation are very high, and the ESS are very low. This should be because of the higher correlation of the two features.

Performance Evaluation

guessing paramter

Finally, we want to check the performance of the MCMC result. We use the 8 significant variables in MCMC to train the model and compare the result with guessing parameter and without guessing parameter. When we train the model in order to imporve the accuracy, we also do the oversampling to balance our target.

## 
## Call:
## glm(formula = Claim ~ ., family = "binomial", data = train_pred)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1754  -0.8036  -0.5535   0.6158   1.9758  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 1.04448    0.05815   17.96   <2e-16 ***
## AgencyC2B                   3.36540    0.04182   80.47   <2e-16 ***
## Distribution.ChannelOnline -2.00928    0.05840  -34.41   <2e-16 ***
## AgencyLWC                   2.46889    0.04934   50.04   <2e-16 ***
## AgencyTST                  -1.70233    0.11253  -15.13   <2e-16 ***
## AgencyKML                   1.22307    0.07241   16.89   <2e-16 ***
## AgencyCWT                   0.67507    0.02279   29.61   <2e-16 ***
## DestinationSINGAPORE       -0.83394    0.04080  -20.44   <2e-16 ***
## DestinationMALAYSIA        -0.81540    0.03481  -23.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 145134  on 104691  degrees of freedom
## Residual deviance: 111601  on 104683  degrees of freedom
## AIC: 111619
## 
## Number of Fisher Scoring iterations: 4
##     predict
## real    0    1
##    0 6365 1128
##    1   48   55
## [1] 0.8451817

##     predict
## real    0    1
##    0 6365 1128
##    1   48   55
## [1] 0.8451817

From the result we can see, the two results are almost the same, as we know when alpha equals to 0, the model is non-robust, just the logistic regression. when alpha equals to 1, the model is a horizontal line with y intercept 1/2. This means our model is very very close to a non-robust model, this should because the value of our guessing parameter is too small, so it is almost has no influence on our model.

multicollinearity

From the correlation plot, we found that the correlation between AgencyC2B and DestinationSINGOPORE is the highest. And the result from the previous MCMC also showed the beta of these two features are not stable, which can see from the four diagnosis plots. Therefore, we decided to drop one of them to see how multicollinearity influences the sample stability, and we choose to keep DestinationSINGOPORE in our next round. After we do the MCMC again, from the diagnosis plots, even the chains still not converge very well, but we can see that the new diagnosis of DestinationSINGOPORE did get improved a lot, the ESS increased and the MCSE decreased.

Conclusion

In conclusion, non-robust Logistic regression is good enough to use on our project. In MCMC only 8 of 15 variables are significant. When there are high correlation variables in a dataset, this will impact the convergence of samples in MCMC.